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In the Specification 

Please replace the paragraph at page 1, lines 4 through 12 with the following paragraph: 
Most eukaryotic cells have an operational center called the nucleus which contains 
structures called chromosomes. Chemically, chromosomes are formed of deoxyribonucleic acid 
(DNA) and associated protein molecules. Each chromosome may have tens of thousands of 
genes. Some genes are referred to as "encoding" (or carrying information for constructing) 
proteins which are essential in the structuring, functioning and regulating of cells, tissues and 
organs. Thus, for each organism, the components of the DNA molecules encode much of the 
information necessary for creating and maintaining life of the organism. See Human Genome 
Program, U.S. Department of Energy, "Primer on Molecular Genetics", Washington, D.C., 1992. 



Please replace the paragraph at page 2, lines 9 through 16 with the following paragraph: 
As previously mentioned, each DNA molecule contains many genes. A gene is a specific 
sequence of nucleotide bases. These sequences carry the information required for constructing 
m \ J / proteins. A protein is a large molecule formed of one or more chains of amino acids in a specific 
order. Order is determined by base sequence of nucleotides in the gene coding for the protein. 
Each protein has a well-defined functionality. A DNA sequence consists of many biologically 
distinct regions. For the purpose of this application, Applicants distinguish between intergenic 
DNA and genes. Many of the genes in mammalian cells are "split genes". A split gene consists 
of coding and non-coding sequences. The coding sequences in a gene are contained within 
exonic regions (exons), that appear sequentially separated by long regions referred to as introns. 

Please replace the paragraph at page 3, lines 3 through 25 with the following paragraph: 



j 3 



Gene identification and gene discovery in newly sequenced genomic sequences is one of 
the most timely computational questions addressed by bioinformatics scientists. Popular gene 
finding systems include Glimmer, Genmark, Genscan, Genie, GENEWISE, and Grail (See 
Burge, C. and S. Karlin, "Prediction of complete gene structures in human genomic DNA," J. 
Mol Biol, 2(55:78-94, 1997; Salzberg, S. et al., "Microbial gene identification using interpolated 
Markov models," Nucl Acids Res., 2(5^:544-548, 1998; Xu, Y. et al., "Grail: A multi-agent 



neural network system for gene identification," Proc. of the IEEE, 84(10):\544-\552, 1996; 
Kulp, D. et al., "A generalized hidden Markov model for the recognition of human genes in 
DNA," in ISMB-96: Proc. Fourth Intl. Conf Intelligent Systems for Molecular Biology, pp. 134- 
141, Menlo Park, CA, 1996, AAAI Press; Borodovsky, M. and J.D. Mclninch, "Genemark: 
Parallel gene recognition for both DNA strands," Computers & Chemistry, 770:123-133, 1993; 
and Salzberg, S. et al. eds., Computational Methods in Molecular Biology, Vol. 32 of New 
Comprehensive Biochemistry, Elsevier Science B.V., Amsterdam, 1998). The annotations 
produced by gene finding systems have been made available to the public. Such projects include 
the genomes of over thirty microbial organisms, as well as Malaria, Drosophila, C.elegans, 
mouse, Human and others. For instance, Glimmer has been widely used in the analysis of many 
microbial genomes and has reported over 98% accuracy in prediction accuracy (See Fraser, CM. 
et al., "Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi," Nature 
390(6660):SS0-586, December 1997). Genie (D. Kulp et al. above) has been deployed in the 
^ analysis of the Drosophila genome, and Genscan (C. Burge and S. Karlin above) was used for 
analysis of human chromosome 22. 



Please replace the paragraph at page 4, lines 13 through 23 with the following paragraph: 



On a very high level, genes in human DNA and many other organisms have a relatively 
simple structure. All eukaryotic genes, including human genes, are thought to share a similar 
layout. This layout adheres to the following "grammar" or pattern: start codon, exon, (intron- 
exon) n , stop codon. The start codon is a specific 3-base sequence (e.g. ATG) which signals the 
beginning of the gene. Exons contain the actual genetic material that code for proteins as 
mentioned above. Introns are the spacer segments of DNA whose function is not clearly 
understood. Finally, stop codons (e.g., TAA) signal the end of the gene. The notation (intron- 
exon) n simply means that there are n alternating intron-exon segments. Genes identification 
procedures has to take into account other important issues such as polyA tail, promoters, pseudo- 
genes, alternative splicing and other features. 



Please replace the paragraphs at page 5, lines 3 through 7and lines 8 through 16 with the 
following paragraphs: _____ — ■ 

General theory for the combination of experts has drawn significant interest in the 
machine learning community. Theory and practice of combining experts have been studied in 
literature. The choice of a particular way of combining expert predictions depends on the 
properties of individual experts and the demands posed by the problem at hand. 

Most techniques for combining gene predictions proposed in the past have been rather 
trivial or have relied on ad hoc combinations of experts. In one prior project, Murakami and 
Takagi (Murakami, K. and T. Takagi, "Gene recognition by combination of several gene-finding 
programs," Bioinformatics, 14(8):665-615, 1998) proposed a system for gene recognition that 
combines several gene-finding programs. They implemented an AND and OR combination, 
HIGHEST-method (best individual expert), RULE-method (decisions using sets of expert rules), 
and an ad hoc BOUND ARY-method. The best of these methods achieved an improvement in 
general accuracy of 3%-5% over the individual gene finders. 



Replace the paragraph at page 6, lines 8 through 1 1 with the below paragraph: 

In accordance with another aspect of the present invention, the preliminarily predicted 

gene locations and/or predicted genes include exon (or coding regions) predictions. 

Alternatively, the gene locations for predicted genes are indicated by exons and introns (i.e., 

coding and non-coding regions) of the subject genome sequence. 



Please replace the paragraph at page 7, lines 3 through 9 with the following paragraph: 



By way of background, ligated exons form a sequence that is translated into a protein by a 
simple but still computationally mysterious mechanism of splicing that takes place after the DNA 
sequence has been transcribed into RNA. The process starts by spliceome proteins that recognize 
the splice signals, followed by a step where the introns are cut out (spliced out), and ending in a 
phase where the consecutive exons are "glued" together into a single sequence that is translated 
into a protein. Intuitively speaking this process is performed on an RNA "image" of the genomic 
sequence. 



Please replace the paragraphs at page 8, lines 1 through 5 and lines 6 through 8 with the 
following paragraphs: . . . 

The present invention is a system for the combination of individual experts which is 
learned from data. Unlike the prior art, such a system exploits learned dependencies between 
experts and forms a prediction maximally consistent with known gene data. Statistically, 
predictions of the invention system have the potential to refine the boundaries and verify the 
predictions made by experts. 

An attractive methodology that exploits the joint statistics of expert systems and avoids 
the shortcomings of the prior art is based on Bayesian networks. 



Please replace the paragraph at page 9, lines 3 through 15 with the following paragraph: 



Referring back to Fig. 1 A, the Bayesian network combiner 17 is trained using known 
DNA sequences with known genes (exons/introns) as the training data 35. The training data is 
applied to the computer apparatus. That is, a corresponding character string representing the 
known DNA sequence is input to the invention software 21. Invention software 21 applies this 
training data character string to the expert systems 15. The expert systems 15 each 
determine/predict preliminary exons/introns 19 in the training data 35. The preliminary 
exon/intron predictions 19 from the expert systems 15 are fed into the Bayesian network 
combiner 17. In turn, the Bayesian network combiner 17 combines the preliminary exon/intron 
predictions 19 in a manner consistent with the known genes (exons/introns locations and 
pattern). That is, the Bayesian network combiner 17 is trained to make the combination of 
preliminary exon/intron predictions produce the known exons on output. In this way, the 
Bayesian network combiner 17 is said to be trained on the training data 35. 



Please replace the paragraph at page 10, lines 7 through 18 with the following paragraph: 
Bayesian network probabilistic models provide a flexible and powerful framework for 
statistical inference as well as learning of model parameters from data. The goal of inference is to 
find a distribution of one or more random variables in the network conditioned on evidence 
(known values) of other variables. Bayesian networks encompass efficient inference algorithms, 
such as Jensen's junction tree (Jensen, F.V., An Introduction to Bayesian Networks, Spring- 



Verlag, 1995) or Pearl's message passing (Pearl, J., Probabilistic reasoning in intelligent systems, 
Morgan Kaufmann, San Mateo, CA 1998). Inside a learning loop, such algorithms maybe used 
to efficiently estimate optimal values of a model's parameters from data (for instance, see Jordan, 
M.I. ed., Learning in Graphical Models, Kluwer Academic Publishers, 1998). Furthermore, 
^ techniques exist that can optimally determine the topology of a Bayesian network together with 
its parameters directly from data. 



Please replace the paragraph at page 1 1, lines 9 through 18 with the following paragraph: 
Gene combiner parameters, probability tables P(Ej|Y) and P(Y), are learned from a 
training dataset of nucleotide sequences by statistically calculating P(Ej|Y) and P(Y) of all 
individual predictors E { and labeled for ground truth Y. For instance, a maximum likelihood 
(ML) estimate of these parameters for a training set of N nucleotides is 



where e denotes the prediction of an expert system i,ee {intron,exon}, and y is the combined 
prediction, y e {intron,exon}. # E^e, Y=y denotes the number of cases in the training dataset 
where the prediction of expert system i is e and the combined prediction is y. Alternative 
estimates of these parameters may be obtained using MAP (maximum a posteriori) estimation. 



Please replace the paragraph at page 12, lines 12 through 16 with the following 
paragraph: __ 

It can be easily shown that the performance of the full Bayes model 31 is at least as good 
as that of the best individual expert system 25, 27, 29. Furthermore, the previously used AND, 
OR and majority models are special cases of the full Bayes combiner 31 . Nevertheless, this 
model 31 still assumes that the annotation of a particular nucleotide is independent of the 
annotation of any other nucleotide in the sequence. 



Please replace the paragraph at page 12, lines 22 through 24 with the following 
paragraph: 



Index t in Figure 4 refers to the sample at location t in the subject nucleotide sequence. 
For instance, the sequence Y t l ,Y\Y t+1 gives the combined annotation for the subject nucleotide 
sequence n positions t-1, t, and t+1 respectively. 



A 



Please replace the paragraph at page 14, lines 6 through 13 with the following paragraph: 

For that purpose, Applicants assumed that each individual expert system provides the 
following binary decision. An expert system produces a single labeling for every nucleotide in a 
sequence: E if the nucleotide is a part of an exon and I if it belongs to an intron or an intergenic 
region. Using the notation of Applicants 1 models, E; e {E, 1} for an expert i. Similarly, a 
combined decision Y is either E or I. Parameters of each of the four above-discussed models of 
Bayesian network combiners 28, 31, 40, 51 were learned using a standard maximum likelihood 
estimation in the Bayesian network framework. All prediction results were then obtained using a 
five-fold cross-validation. 



Please replace the paragraph at page 15, lines 7 through 12 with the following paragraph: 
An exon is said to be exactly predicted 47 only if both its ending and beginning points 

coincides with that of a true exon. An exon is said to be missed 57 if there is no overlap with any 
£ of the predicted exons. ME gives the percentage of missed exons 57 whereas WE gives the 

percentage of wrongly or overpredicted exons 49. To compute these two numbers (ME and 

WE), Applicants look for any overlap between a true and a predicted exon. Wrong exon (WE) 

prediction implies the prediction has no overlap with a true exon. 



Amendments to the specification are indicated in the attached "Marked Up Version of 
Amendments" (pages i - vii). 



Please amend Claims 4 and 10. Amendments to the claims are indicated in the attached 
"Marked Up Version of Amendments" (page vii-viii). 



4. (Amended) Computer apparatus as claimed in Claim 3 wherein the Bayesian network 
combines the predicted gene locations according to 
Y* = max Yt P(Y t |E 1 ,...,E n , Y* M ) 
Ei 6 {E, 1} 

where t is location in the subject genomic sequence and E lv ..,E n are the respective 
predictions (E for exon or I for intron or intergenic region) made by individual units 1 
through n, n being the number of units in the plurality. 

10. (Amended) A method as claimed in Claim 9 wherein the step of combining using a 
Bayesian network combines according to 
Y*=max yt P(Y t |E lv ..,E n) Y*.,) 
^ 6 {E, 1} 

where t is location in the subject genomic sequence and E ls ...,E n are the respective 
predictions (E for exon or I for intron or intergenic region) made by individual expert 
systems, n being the number of expert systems in the plurality. 



